class: center, middle, inverse, title-slide # PLSC30500, Fall 2022 ## Week 1. Introduction and programming, pt. 1 --- # This course - Instructors: Andy Eggers and Molly Offer-Westort - TA: Oscar Cuadros - part of a sequence: - Intro to Quant Soc Sci **(this course)** (fall) - Causal Inference (winter) - Linear Models (spring) --- # Our objectives - give a strong foundation for further study - give a taste of what is fun about quantitative social science - working with data, visualizing relationships - mathematical rigor and clarity - thinking about causality and identification -- Week by week: - Programming (lecture week 1, several labs, every homework) - Probability (week 2) - Summarizing distributions (week 3) - Causality and identification (week 4) - Estimation (weeks 5 and 7) - Inference (weeks 6 and 8) - Presentations of data analysis projects (week 9) --- # Expectations about background Very useful (but not required) to have some exposure to some of - math (semi-recently) - probability & statistics - econometrics/regression modeling - programming -- If you have very little exposure to something above, you will have to work harder on that. If you have lots of exposure to all of the above, we hope you can still learn something. --- # Expectations for the course - Read the syllabus - Prepare for class: attempt the main reading (usually Aronow & Miller); start with an easier reading if necessary - If you are stuck on reading/assignments: 1. Use google first. 1. Ask your question on our private StackOverflow (https://stackoverflow.com/c/uchicagopolmeth) 1. Or if you're brave, ask on the real StackOverflow (https://stackoverflow.com/) if it's about `R` or CrossValidated (https://stats.stackexchange.com/) if it's about stats. - If you are confused in class, ask a question Please also *answer* questions on our private StackOverflow. If you need to email one of us, please email both of us. <!-- --- # What we do with data We'll get started on working with data, and we'll also get started on thinking critically about how you use data to answer questions. What data would you need to make the argument in the article below?  --- # Inferential questions - What can the data you *do* have tell you about data you *don't* have? - What data would you need to answer questions about *what would have happened*? - What can we say about our *uncertainty* about estimates or predictions? --> --- # Assessment - 40% problem sets (8 in all) - 30% independent data analysis project (presentation and report) - 20% in-class midterm on October 28 - 10% class participation ??? Brief overview of data analysis project --- # Technical setup You should have already done the following: 1. install `R` from https://cran.rstudio.com/ 1. Install RStudio from https://www.rstudio.com/products/rstudio/download/ 1. In RStudio install `tidyverse` and `tinytex` If you run into trouble, use (free) RStudio Cloud until you resolve the issue. --- class: inverse, middle, center # Toward a grammar of graphics --- class: bg-full background-image: url("data:image/png;base64,#assets/rosling_youtube.png") background-position: center background-size: contain ??? Source: https://www.youtube.com/watch?v=jbkSRLYSojo --- class: bg-full background-image: url("data:image/png;base64,#assets/rosling_youtube_zoom.png") background-position: center background-size: contain ??? Source: https://www.youtube.com/watch?v=jbkSRLYSojo --- # Mapping attributes to aesthetics Q: What is the **unit of observation**? A: A country in a year Q: How are a country-year's **attributes** mapped to **aesthetic** components of the graphic? <table> <thead> <tr> <th style="text-align:left;"> Attribute </th> <th style="text-align:left;"> Aesthetic </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Income </td> <td style="text-align:left;"> Horizontal position (x) </td> </tr> <tr> <td style="text-align:left;"> Life expectancy </td> <td style="text-align:left;"> Vertical position (y) </td> </tr> <tr> <td style="text-align:left;"> Population </td> <td style="text-align:left;"> Size of point </td> </tr> <tr> <td style="text-align:left;"> Continent </td> <td style="text-align:left;"> Color of point </td> </tr> </tbody> </table> --- class: bg-full background-image: url("data:image/png;base64,#assets/Minard.png") background-position: center background-size: contain ### Minard's graphic on Napoléon in Russia ??? One of the "best statistical drawings ever created" (Tufte, *VDQI*) Source: [Wikipedia](https://en.wikipedia.org/wiki/File:Minard.png) --- # Mapping attributes to aesthetics Q: What is the **unit of observation**? A: An army (army division) on a day ("army-day") Q: How are an army-day's **attributes** mapped to **aesthetic** components of the graphic? <table> <thead> <tr> <th style="text-align:left;"> Attribute </th> <th style="text-align:left;"> Aesthetic </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Longitude </td> <td style="text-align:left;"> Horizontal position (x) </td> </tr> <tr> <td style="text-align:left;"> Latitude </td> <td style="text-align:left;"> Vertical position (y) </td> </tr> <tr> <td style="text-align:left;"> Number of surviving soldiers </td> <td style="text-align:left;"> Width of line </td> </tr> <tr> <td style="text-align:left;"> Direction (advance, retreat) </td> <td style="text-align:left;"> Color of line </td> </tr> </tbody> </table> (Also note secondary plot showing temperature during retreat.) --- # Data: structure Our data is typically **rectangular**, with rows and columns like a spreadsheet. -- Usually, - each row should be one observation (e.g. country-year, army-day) - each column should contain one attribute (e.g. life expectancy, number of surviving troops) -- For example: <table class="table" style="font-size: 12px; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> country </th> <th style="text-align:left;"> continent </th> <th style="text-align:right;"> lifeExp </th> <th style="text-align:right;"> pop </th> <th style="text-align:right;"> gdpPercap </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Afghanistan </td> <td style="text-align:left;"> Asia </td> <td style="text-align:right;"> 43.828 </td> <td style="text-align:right;"> 31889923 </td> <td style="text-align:right;"> 974.5803 </td> </tr> <tr> <td style="text-align:left;"> Albania </td> <td style="text-align:left;"> Europe </td> <td style="text-align:right;"> 76.423 </td> <td style="text-align:right;"> 3600523 </td> <td style="text-align:right;"> 5937.0295 </td> </tr> <tr> <td style="text-align:left;"> Algeria </td> <td style="text-align:left;"> Africa </td> <td style="text-align:right;"> 72.301 </td> <td style="text-align:right;"> 33333216 </td> <td style="text-align:right;"> 6223.3675 </td> </tr> <tr> <td style="text-align:left;"> Angola </td> <td style="text-align:left;"> Africa </td> <td style="text-align:right;"> 42.731 </td> <td style="text-align:right;"> 12420476 </td> <td style="text-align:right;"> 4797.2313 </td> </tr> <tr> <td style="text-align:left;"> Argentina </td> <td style="text-align:left;"> Americas </td> <td style="text-align:right;"> 75.320 </td> <td style="text-align:right;"> 40301927 </td> <td style="text-align:right;"> 12779.3796 </td> </tr> <tr> <td style="text-align:left;"> Australia </td> <td style="text-align:left;"> Oceania </td> <td style="text-align:right;"> 81.235 </td> <td style="text-align:right;"> 20434176 </td> <td style="text-align:right;"> 34435.3674 </td> </tr> <tr> <td style="text-align:left;"> Austria </td> <td style="text-align:left;"> Europe </td> <td style="text-align:right;"> 79.829 </td> <td style="text-align:right;"> 8199783 </td> <td style="text-align:right;"> 36126.4927 </td> </tr> <tr> <td style="text-align:left;"> Bahrain </td> <td style="text-align:left;"> Asia </td> <td style="text-align:right;"> 75.635 </td> <td style="text-align:right;"> 708573 </td> <td style="text-align:right;"> 29796.0483 </td> </tr> <tr> <td style="text-align:left;"> Bangladesh </td> <td style="text-align:left;"> Asia </td> <td style="text-align:right;"> 64.062 </td> <td style="text-align:right;"> 150448339 </td> <td style="text-align:right;"> 1391.2538 </td> </tr> <tr> <td style="text-align:left;"> Belgium </td> <td style="text-align:left;"> Europe </td> <td style="text-align:right;"> 79.441 </td> <td style="text-align:right;"> 10392226 </td> <td style="text-align:right;"> 33692.6051 </td> </tr> </tbody> </table> ??? Data in this format is sometimes referred to as "tidy" [(Wickham 2014)](https://vita.had.co.nz/papers/tidy-data.pdf). I think this concept is useful as long as you recognize that the definition of "unit of observation" (and thus attribute/variable) depends on the purpose for which the data is being used. --- class: inverse, middle, center # Making (beautiful and informative) graphics --- # Making graphics in `R` We will use the `ggplot2` library, which is part of the `tidyverse` library. Basic components of plotting with `ggplot`: - data - mapping of attributes (columns of data) to aesthetics - geometric representations of data (`geom`s) -- To get started: - install the package: `install.packages("tidyverse")` [first time] - load the package: `library(tidyverse)` [every time] --- class: inverse, middle, center # Quick detour: getting data into `R` --- # Getting data into `R` (and `RStudio`) An interactive option: `Import Dataset` button in `Environment` pane of `RStudio` -- But note it's showing you the code it's using! (Live coding example.) --- # Getting data into `R` (cont'd) Most commonly used functions: - `read_csv()` and `read_rds()` in `readr` (`tidyverse`) - `readstata13::read.dta13()` for Stata files (`.dta`) - `readxl::read_excel()` for Excel files (`.xls`, `.xlsx`) - `load()` in base R for "R objects" All require "path" to file as argument. -- Sometimes data is a package, e.g. `babynames`, `gapminder`, `vdemdata` -- See Chapter 11 of R4DS and "Data Import" cheatsheet. --- ## A simple example Get data in: ```r gapminder_2007 <- gapminder::gapminder |> filter(year == 2007 & continent != "Oceania") # will cover later ``` Make a plot: ```r ggplot(data = gapminder_2007, mapping = aes(x = gdpPercap, y = lifeExp)) + geom_point() ``` --- ```r ggplot(data = gapminder_2007, mapping = aes(x = gdpPercap, y = lifeExp)) + geom_point() ``` <img src="data:image/png;base64,#slides_iqss_week_1_pt1_files/figure-html/unnamed-chunk-6-1.png" style="display: block; margin: auto;" /> To note: - the **arguments** to `ggplot()` say what the data is (`data = gapminder_2007`) and how attributes are mapped to aesthetics (`mapping = aes(x = gdpPercap, y = lifeExp)`) - `geom_point()` says "plot a point for each observation" - **layers** of plot linked with plus sign (`+`) --- <!-- Let's map population to the size of the points: --> ```r ggplot(data = gapminder_2007, mapping = aes(x = gdpPercap, y = lifeExp, * size = pop)) + geom_point() ``` <img src="data:image/png;base64,#slides_iqss_week_1_pt1_files/figure-html/unnamed-chunk-7-1.png" style="display: block; margin: auto;" /> --- <!-- Let's map continent to the color of the points: --> ```r ggplot(data = gapminder_2007, mapping = aes(x = gdpPercap, y = lifeExp, size = pop, * col = continent)) + geom_point() ``` <img src="data:image/png;base64,#slides_iqss_week_1_pt1_files/figure-html/unnamed-chunk-8-1.png" style="display: block; margin: auto;" /> --- <!-- Let's put the x-axis on the log scale: --> ```r ggplot(data = gapminder_2007, mapping = aes(x = gdpPercap, y = lifeExp, size = pop, col = continent)) + geom_point() + * scale_x_log10() ``` <img src="data:image/png;base64,#slides_iqss_week_1_pt1_files/figure-html/unnamed-chunk-9-1.png" style="display: block; margin: auto;" /> --- # Minard data <table class="table" style="font-size: 15px; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:right;"> long </th> <th style="text-align:right;"> lat </th> <th style="text-align:right;"> survivors </th> <th style="text-align:left;"> direction </th> <th style="text-align:right;"> group </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 24.0 </td> <td style="text-align:right;"> 54.9 </td> <td style="text-align:right;"> 340000 </td> <td style="text-align:left;"> A </td> <td style="text-align:right;"> 1 </td> </tr> <tr> <td style="text-align:right;"> 24.5 </td> <td style="text-align:right;"> 55.0 </td> <td style="text-align:right;"> 340000 </td> <td style="text-align:left;"> A </td> <td style="text-align:right;"> 1 </td> </tr> <tr> <td style="text-align:right;"> 25.5 </td> <td style="text-align:right;"> 54.5 </td> <td style="text-align:right;"> 340000 </td> <td style="text-align:left;"> A </td> <td style="text-align:right;"> 1 </td> </tr> <tr> <td style="text-align:right;"> 26.0 </td> <td style="text-align:right;"> 54.7 </td> <td style="text-align:right;"> 320000 </td> <td style="text-align:left;"> A </td> <td style="text-align:right;"> 1 </td> </tr> <tr> <td style="text-align:right;"> 27.0 </td> <td style="text-align:right;"> 54.8 </td> <td style="text-align:right;"> 300000 </td> <td style="text-align:left;"> A </td> <td style="text-align:right;"> 1 </td> </tr> <tr> <td style="text-align:right;"> 28.0 </td> <td style="text-align:right;"> 54.9 </td> <td style="text-align:right;"> 280000 </td> <td style="text-align:left;"> A </td> <td style="text-align:right;"> 1 </td> </tr> <tr> <td style="text-align:right;"> 28.5 </td> <td style="text-align:right;"> 55.0 </td> <td style="text-align:right;"> 240000 </td> <td style="text-align:left;"> A </td> <td style="text-align:right;"> 1 </td> </tr> <tr> <td style="text-align:right;"> 29.0 </td> <td style="text-align:right;"> 55.1 </td> <td style="text-align:right;"> 210000 </td> <td style="text-align:left;"> A </td> <td style="text-align:right;"> 1 </td> </tr> <tr> <td style="text-align:right;"> 30.0 </td> <td style="text-align:right;"> 55.2 </td> <td style="text-align:right;"> 180000 </td> <td style="text-align:left;"> A </td> <td style="text-align:right;"> 1 </td> </tr> <tr> <td style="text-align:right;"> 30.3 </td> <td style="text-align:right;"> 55.3 </td> <td style="text-align:right;"> 175000 </td> <td style="text-align:left;"> A </td> <td style="text-align:right;"> 1 </td> </tr> <tr> <td style="text-align:right;"> 32.0 </td> <td style="text-align:right;"> 54.8 </td> <td style="text-align:right;"> 145000 </td> <td style="text-align:left;"> A </td> <td style="text-align:right;"> 1 </td> </tr> <tr> <td style="text-align:right;"> 33.2 </td> <td style="text-align:right;"> 54.9 </td> <td style="text-align:right;"> 140000 </td> <td style="text-align:left;"> A </td> <td style="text-align:right;"> 1 </td> </tr> <tr> <td style="text-align:right;"> 34.4 </td> <td style="text-align:right;"> 55.5 </td> <td style="text-align:right;"> 127100 </td> <td style="text-align:left;"> A </td> <td style="text-align:right;"> 1 </td> </tr> <tr> <td style="text-align:right;"> 35.5 </td> <td style="text-align:right;"> 55.4 </td> <td style="text-align:right;"> 100000 </td> <td style="text-align:left;"> A </td> <td style="text-align:right;"> 1 </td> </tr> <tr> <td style="text-align:right;"> 36.0 </td> <td style="text-align:right;"> 55.5 </td> <td style="text-align:right;"> 100000 </td> <td style="text-align:left;"> A </td> <td style="text-align:right;"> 1 </td> </tr> <tr> <td style="text-align:right;"> 37.6 </td> <td style="text-align:right;"> 55.8 </td> <td style="text-align:right;"> 100000 </td> <td style="text-align:left;"> A </td> <td style="text-align:right;"> 1 </td> </tr> <tr> <td style="text-align:right;"> 37.7 </td> <td style="text-align:right;"> 55.7 </td> <td style="text-align:right;"> 100000 </td> <td style="text-align:left;"> R </td> <td style="text-align:right;"> 1 </td> </tr> <tr> <td style="text-align:right;"> 37.5 </td> <td style="text-align:right;"> 55.7 </td> <td style="text-align:right;"> 98000 </td> <td style="text-align:left;"> R </td> <td style="text-align:right;"> 1 </td> </tr> </tbody> </table> --- ```r ggplot(data = minard, mapping = aes(x = long, y = lat, size = survivors, col = direction, group = group)) + * geom_path() ``` <img src="data:image/png;base64,#slides_iqss_week_1_pt1_files/figure-html/unnamed-chunk-12-1.png" style="display: block; margin: auto;" /> ??? "`geom_path()` connects the observations in the order in which they appear in the data. `geom_line()` connects them in order of the variable on the x axis. `geom_step()` creates a stairstep plot, highlighting exactly when changes occur." Source: https://ggplot2.tidyverse.org/reference/geom_path.html --- # Back to `gapminder` ```r ggplot(data = gapminder_2007, mapping = aes(x = gdpPercap, y = lifeExp, size = pop, col = continent)) + geom_point() + scale_x_log10() ``` <img src="data:image/png;base64,#slides_iqss_week_1_pt1_files/figure-html/unnamed-chunk-13-1.png" style="display: block; margin: auto;" /> --- # Adding a smoothing line `geom_smooth()` adds a "smoother". Let's try adding it! -- ```r ggplot(data = gapminder_2007, mapping = aes(x = gdpPercap, y = lifeExp, size = pop, col = continent)) + geom_point() + scale_x_log10() + * geom_smooth() ``` <img src="data:image/png;base64,#slides_iqss_week_1_pt1_files/figure-html/unnamed-chunk-14-1.png" style="display: block; margin: auto;" /> Hmm. ??? If you don't exclude Oceania, `ggplot` refuses to make a smoother using default settings because there are too few countries in Oceania. --- # Inheritance of aesthetics ```r ggplot(data = gapminder_2007, mapping = aes(x = gdpPercap, y = lifeExp, size = pop)) + * geom_point(aes(col = continent)) + scale_x_log10() ``` <img src="data:image/png;base64,#slides_iqss_week_1_pt1_files/figure-html/unnamed-chunk-15-1.png" style="display: block; margin: auto;" /> --- ## Data summary w. `geom_smooth()` ```r ggplot(data = gapminder_2007, mapping = aes(x = gdpPercap, y = lifeExp, size = pop)) + geom_point(aes(col = continent)) + scale_x_log10() + * geom_smooth() ``` <img src="data:image/png;base64,#slides_iqss_week_1_pt1_files/figure-html/unnamed-chunk-16-1.png" style="display: block; margin: auto;" /> --- ## Linear version w. `geom_smooth()` ```r ggplot(data = gapminder_2007, mapping = aes(x = gdpPercap, y = lifeExp, size = pop)) + geom_point(aes(col = continent)) + scale_x_log10() + * geom_smooth(method = lm) # lm means "linear model" ``` <img src="data:image/png;base64,#slides_iqss_week_1_pt1_files/figure-html/unnamed-chunk-17-1.png" style="display: block; margin: auto;" /> ??? Anything after # (on the same line) is a "comment" and is ignored by R. This is useful for explaining to humans what is going on in the code. --- ## Small multiples: `facet_wrap()` ```r ggplot(data = gapminder_2007, mapping = aes(x = gdpPercap, y = lifeExp, size = pop)) + geom_point() + scale_x_log10() + geom_smooth(method = lm) + * facet_wrap(vars(continent)) ``` <img src="data:image/png;base64,#slides_iqss_week_1_pt1_files/figure-html/unnamed-chunk-18-1.png" style="display: block; margin: auto;" /> --- # Other geoms ```r ggplot(data = gapminder_2007, mapping = aes(x = gdpPercap, y = lifeExp)) + * geom_density2d() + scale_x_log10() ``` <img src="data:image/png;base64,#slides_iqss_week_1_pt1_files/figure-html/unnamed-chunk-19-1.png" style="display: block; margin: auto;" /> --- # Other geoms ```r ggplot(data = gapminder_2007, mapping = aes(x = gdpPercap, y = lifeExp)) + * geom_text(aes(label = country)) + scale_x_log10() ``` <img src="data:image/png;base64,#slides_iqss_week_1_pt1_files/figure-html/unnamed-chunk-20-1.png" style="display: block; margin: auto;" /> --- # How to learn more about visualization Practice and experiment. (And do problem sets.) Resources: - *R For Data Science* - RStudio primers - RStudio "Data Visualization" cheat sheet - Google - StackOverflow - Readings on syllabus (Kieran Healy) --- class: bg-full background-image: url("data:image/png;base64,#assets/data_viz_cheatsheet.png") background-position: center background-size: contain ??? Source: [RStudio Cheatsheets](https://www.rstudio.com/resources/cheatsheets/) --- # Back to the big picture Components of a `ggplot`: - data, with observations in rows, attributes in columns - mapping of attributes to aesthetics (x, y, size, shape, color, transparency, etc) - geometric objects (`geom`s) Next: getting data into the right format for plotting (and analysis).